home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Trading on the Edge
/
Trading On The Edge - CD-ROM Toolkit (Wayzata Technology)(2031)(1994).bin
/
pc
/
pc_files
/
mktdata
/
econdata
/
docutils
/
pdgsup.exe
/
PRESS.DOC
< prev
next >
Wrap
Text File
|
1992-03-17
|
10KB
|
215 lines
Data Bank Compression Programs
There are two compression programs: CPRESS and HPRESS. CPRESS is the
orginal PRESS program, while HPRESS compresses G workspace bank directly
into a hashed bank. Hashed banks are now the standard data banks that
I.E.R.F. creates and maintains. The compression routines for hashed and
compressed banks are essentially the same, however, with the hashed banks
we have added a far superior method of indexing the data series.
Since the compression of both types of banks is so similar, the syntax
and usage for both compression programs are essentially the same as well.
Therefore, what follows assumes one is using HPRESS. If you wish to use
the original compression program, simply substitute CPRESS where HPRESS
appears and read 'compress,' where 'hash' appears below.
The HPRESS data bank compression program can generally reduce the size
of a G data bank by at least a factor of 2 and sometimes by much more.
The banks which it produces can be used directly in G and PDG 5.05 and above.
The general invocation of HPRESS is:
hpress <original_bank> [hashed_bank] [-m<missing>] [-s<maxslash>]
As usual, the <> enclose required items; [ ], optional items.
Here are several examples:
hpress nipaq
hpress nipaq e:nipaq
hpress nipaq -m-999999
hpress nipaq -s2
hpress nipaq e:nipaq -m-999999 -s4
If no hashed bank is named, as in the first example, the original bank
name, but with different extensions is used for the output bank. If the data
source has used some unusual value, such as -999999, to indicate a missing
value, these observations may be turned to zeroes using the -m option, as in
the last example. The meaning of the "maxslash" option is explained below.
Banks which have been pressed by HPRESS have the extensions ".hin" and ".hbk"
for their index and data files, respectively. To assign a hashed bank
in G or PDG, the command is simply
hbk <bank_name>
for example
hbk nipaq
assigns the hashed quarterly NIPA bank. All other commands should then
work exactly as with any other assigned bank. It is NOT possible to assign a
hashed bank as a workspace with the wsb command. It IS possible to assign
a hashed bank as the initial bank in the g.cfg file.
HPRESS does two things to compress each series.
1. Leading and trailing zeroes are removed from the series.
2. Whenever possible, a series is represented mainly by 2-byte
integers rather than by 4-byte floating point numbers.
To accomplish the second step, the number of decimal places in each series is
found and the decimal point is slid to the right that many places and the
result expressed as a 4-byte integer. (If some number is too big to be
expressed as a 4-byte integer, the number of decimal places is reduced and the
process repeated.) First differences of the (non-zero) values are then
calculated and checked for their expressability as two-byte integers. If all
of them pass, the series is then stored by recording the starting date,
frequencey, number of observations, number of decimal places, starting
observation as a four-byte integer and then the first differences as two-byte
integers. If this compression fails, then either (a) the numbers are stored as four-byte
floating-point numbers or (b) the differences are divided by a power of 2 up
to a maximum of "maxslash" as given by the -s option.
What precision is possible in a compressed series? A laser printer typically
prints 300 dots per inch. The precision of the first differences in a
compressed series is comparable to one such dot in a graph nine feet high!
Although all series in the US quarterly national accounts from 1947 to 1988
compress with complete accuracy, about five percent of the series in the Blue
Pages of the Survey of Current Business fail, and about ten percent of the
series in the International Financial Statistics fail to compress. This
failure occurs when series are being carried to six or seven significant
figures in the sources. Obviously, this much accuracy is seldom of any value
in economic use of the series. If compaction is quite important, you may
therefore want to compress these series at a minor cost in terms of accuracy.
To do so, use the -s option on the command line to set "maxslash", the maximum
power of 2 which will be used to divide the differences to get them down to
the size which can be expressed as a two-byte integer. Obviously, the slash
value actually used for each series is stored with the series and is used by G
in interpreting the compressed series. A file called "forced" is created with
each compression. It lists all series either slashed (marked "forced") or
dumped as four-byte floating point numbers (marked "gave up"). This file has
the form of an add file for G to draw graphs of the original and compressed
series. The default value of maxslash is 0; only compression with perfect
accuracy allowed. However, values of maxslash as high as 4 have not led to
graphically distinguishable series. (Files on ECONDATA, the replacement of
MECCA, are compressed with maxslash = 0.)
Even in the case of failure in compression, the elimination of leading and
trailing zeroes often reduces the size of the bank. Also, the organization
of index file in a hashed bank greatly speeds up the series searching process
in PDG, among others.
The easist way to update a hashed bank is by using HSPLICE. There is
another (and, unfortunately, more tedious) way to update a hashed bank. For
example, to update a hashed bank, say nipaq.hbk, from the file newnipq.hbk,
the steps are as follows:
1. Run the BUPS program (2.0) included in the PDG 5.1 package:
bups nipaq
This will create the ascii file nipaq.bup.
2. Start G or PDG and select option 'a' on the opening menu, and set the
starting date and size of the updated bank. Then do:
hbk nipaq
add nipaq.bup
hbk newnipq
add nipaq.bup
q
3. The workspace bank of G or PDG is now the updated bank. Rename it. If it
was named ws and we wanted to rename it upnipaq, then do
ren ws.* upnipaq.*
4. If you wish to compress it, do
hpress upnipaq
When you have checked that all is well in the updated bank, you may, of
course, wish to rename it to nipaq.
**********************************
** Note for Programers **
**********************************
The precise form of the hashed bank .hin and .hbk files is as follows:
The ".hin" file contains:
item size in bytes C type
==============================================================
ns 4 long
nbins 2 unsigned
nsb 2*nbins unsigned
ncharb 2*nbins unsigned
posbin 4*nbins unsigned
binname(0) nchar[0] char
binposts(0) 4*nnmsb[0] long
binname(1) nchar[1] char
binposts(1) 4*nnmsb[1] long
binname(2) nchar[2] char
binposts(2) 4*nnmsb[2] long
.
.
binname(nbins-1) nchar[nbins-1] char
binposts(nbins-1) 4*nnmsb[nbins-1] long
Here, ns denotes the number of series in the bank. The series are
separated into "bins". The number of bins in the bank is denoted by
nbins. The number of series in each bin is denoted by the array nsb.
The sum of the number of characters in the names (including each '\0')
of the series contained in each bin is denoted by the array ncharb.
The beginning positions in the ".hin" file of the first bytes of the
binname() strings is given by the array posbin. The string binname(i)
denotes the concatination (including the \0's) of all the series names
in the i-th bin. Finally, binposts(i) denotes the array of beginning
positions in the associated ".hbk" of the series in the the i-th bin.
Of course, the ordering of the series in the binname() and binposts()
arrays must be the same.
Consider an example. Suppose that the 3rd bin contains the series
"joe", "dave", and "bill". The string binname(3) would be
"joe\0dave\0bill\0"
Suppose that the starting positions in the ".hbk" bank for the three
series are 40700008, 490987, 3378294. The array binposts(3) would then
be [40700008, 490987, 3378294]. And nsb[3] = 3, and ncharb[3] = 14.
If the beginning position of binname(3) in the ".hin" file is 4724, then
posbin[3] = 4724.
To assign a bin number to a series you must use the following hashing
routine. In C, the routine is:
unsigned hash(char *s);
hash(char *s)
{
unsigned bill;
for (bill=0;*s!='\0';s++) bill = *s + 31*bill;
bill = bill%nbins;
return(bill);
}
To continue with the example, to determine the bin which the series "joe"
really belongs to you'd evaluate the function hash("joe").
The .hbk file:
0 - 79 char Name of bank (terminated with a null)
80 - 81 int ns, number of series in the bank
82 - 85 long psn, position in file of index
86 - first series, as described below
*(psn+1) - second series,
... ...
psn long position in file of first byte of first series
psn+4 long position in file of first byte of second series
... ... on out to ns series
For each series, the format is:
byte Content
0 base year
1 frequency*16+period
2 slash*16+maxplaces or 255 if not compressed
3-4 number of observations
5-8 first observation as a long
9 - differences as integers
if not compressed, floats begin in byte 5